Skip to content

Fixing porter can strand parcels until daemon restart#2132

Open
sergey3bv wants to merge 2 commits into
lightninglabs:mainfrom
sergey3bv:fix/strand-parcels
Open

Fixing porter can strand parcels until daemon restart#2132
sergey3bv wants to merge 2 commits into
lightninglabs:mainfrom
sergey3bv:fix/strand-parcels

Conversation

@sergey3bv

Copy link
Copy Markdown
Contributor

Should close #2120

@gemini-code-assist

Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug where sweep parcels could become stranded in the chain porter if a failure occurred after the initial response was sent to the caller. It introduces a robust retry mechanism for specific post-delivery states and optimizes how sweeper-originated transactions interact with the porter to prevent unnecessary broadcasts.

Highlights

  • Post-delivery failure recovery: Implemented a retry mechanism for recoverable post-delivery failures in the chain porter, allowing stranded parcels to be requeued with exponential backoff instead of requiring a daemon restart.
  • Sweeper transaction optimization: Updated sweeper-originated channel transactions to skip redundant porter broadcasts while ensuring local addresses are still correctly imported.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@sergey3bv sergey3bv force-pushed the fix/strand-parcels branch from 214b5d6 to b7633c0 Compare May 13, 2026 12:55
@sergey3bv sergey3bv marked this pull request as ready for review May 13, 2026 12:55

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a retry mechanism for the chain porter to handle recoverable post-delivery failures, such as those occurring during transaction confirmation or proof transfer. It also refactors the state machine to ensure local addresses are imported even when anchor transaction broadcasts are skipped. Review feedback highlighted a potential deadlock in the advanceState goroutine when handling background parcels with unbuffered error channels. Other improvements suggested include clearing retry bookkeeping upon permanent failure to prevent memory leaks and implementing a maximum retry limit to avoid infinite loops.

Comment thread tapfreighter/chain_porter.go Outdated
Comment thread tapfreighter/chain_porter.go
Comment thread tapfreighter/chain_porter.go
@sergey3bv sergey3bv force-pushed the fix/strand-parcels branch 3 times, most recently from 97bc49d to e1b2466 Compare May 14, 2026 07:40

@kaldun-tech kaldun-tech left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change here is solid and addresses the failure case. Some minor suggestions related to improving documentation and adding a test case for shutdown

daveSweepTxHash := daveSweepBlocks[0].Transactions[1].TxHash()
daveSweepTxHash := resolveMinedTransferTxid(
t.t, dave, daveSweepBlocks[0],
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use more context on the design decisions here. The old code failed with RBF (replace-by-fee) because it assumed the coinbase transaction at index 0, and sweep transaction is at index 1 in the block. And there's only one non-coinbase transaction in the block. The RBF sweep transactions invalidated this, or if multiple transactions are mined in the same block.

The new approach with resolveMinedTransferTxid finds which transaction in the block tapd know is a transfer. So decouples the tests from block transaction ordering, RBF replacements, multiple transactions in a block and tapd timing issues.


closeStream, _, err := net.CloseChannel(local, chanPoint, false)
require.NoError(t.t, err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the close after the SubscribeSendEvents fixes a race condition:

  1. CloseChannel starts the close process
  2. tapd processes close, fires SendStateComplete event
  3. SubscribeSendEvents called — but event already fired
  4. waitForSendEvent blocks forever waiting for missed event

By subscribing first, all events from the close operation are captured. This implements the standard "subscribe before action" pattern

So it's good we could use more comments IMO

pollInterval = 200 * time.Millisecond
)

require.Eventually(t, func() bool {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will get us more detailed failure messages about the state and make the testing more robust

blockHashSet bool
blockHeight uint32
blockHeightHint uint32
pollInterval = 200 * time.Millisecond

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: 200 millis gets used in a few different spots, consider intorducing a constant

case SendStateStorePostAnchorTxConf:
fallthrough
case SendStateTransferProofs:
return true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: The fallthrough style is unusual. More idiomatic:

switch state {
case SendStateWaitTxConf, SendStateStorePostAnchorTxConf, SendStateTransferProofs:
    return true
default:
    return false
}

Or even simpler:

return state == SendStateWaitTxConf ||
    state == SendStateStorePostAnchorTxConf ||
    state == SendStateTransferProofs

if pkg == nil || pkg.OutboundPkg == nil ||
pkg.OutboundPkg.AnchorTx == nil {

return

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invariant: we can only add to postDeliveryRetryAttempts if AnchorTx is non-nil, and only need to delete it in the same case


return nil, fmt.Errorf("unable to import local "+
"addresses: %w", err)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block imports the taproot output keys into lnd's wallet for outputs that belong to us. So the fix ensures addresses are always imported, regardless of who broadcasts.

It's a solid change. Think it would benefit from more comments.

"disk: %w", err)
}

ctx, cancel = p.WithCtxQuitNoTimeout()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gives us a new context for importLocalAddresses which is idempotent and can be retried. As opposed to the CtxBlocking context (line 2189) for LogPendingParcel which is a critical write to disk that must complete even during shutdown.

}
}

func TestAdvanceStatePermanentFailureClearsRetryBookkeeping(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The name suggests "permanent failure after max attempts". It actually tests non-recoverable state failure

t.Fatalf("did not expect pending parcel re-queue")
case <-time.After(100 * time.Millisecond):
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Consider adding a test case for process shutdown similar to
`func TestSchedulePostDeliveryRetryShutdownDuringDelay(t *testing.T) {
porter := newTestChainPorter()
pkg := newTestSendPackage(SendStateTransferProofs)

recoverable := porter.schedulePostDeliveryRetry(
    pkg, SendStateTransferProofs, errors.New("failure"),
)
require.True(t, recoverable)

// Close quit before timer fires (base delay is 1 second)
close(porter.Quit)

// Verify goroutine exits cleanly, no re-queue
select {
case <-porter.outboundParcels:
    t.Fatalf("did not expect parcel after shutdown")
case <-time.After(1500 * time.Millisecond):
    // Success
}

}`

@lightninglabs-deploy

Copy link
Copy Markdown

@sergey3bv, remember to re-request review from reviewers when ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🆕 New

Development

Successfully merging this pull request may close these issues.

Porter can strand parcels until daemon restart

3 participants